NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Analysis of metagenomic data

https://doi.org/10.1038/s43586-024-00376-6

Liu, Shaopeng; Rodriguez, Judith S; Munteanu, Viorel; Ronkowski, Cynthia; Sharma, Nitesh Kumar; Alser, Mohammed; Andreace, Francesco; Blekhman, Ran; Błaszczyk, Dagmara; Chikhi, Rayan; et al (December 2025, Nature Reviews Methods Primers)

Metagenomics has revolutionized our understanding of microbial communities, offering unprecedented insights into their genetic and functional diversity across Earth’s diverse ecosystems. Beyond their roles as environmental constituents, microbiomes act as symbionts, profoundly influencing the health and function of their host organisms. Given the inherent complexity of these communities and the diverse environments where they reside, the components of a metagenomics study must be carefully tailored to yield accurate results that are representative of the populations of interest. This Primer examines the methodological advancements and current practices that have shaped the field, from initial stages of sample collection and DNA extraction to the advanced bioinformatics tools employed for data analysis, with a particular focus on the profound impact of next-generation sequencing on the scale and accuracy of metagenomics studies. We critically assess the challenges and limitations inherent in metagenomics experimentation, available technologies and computational analysis methods. Beyond technical methodologies, we explore the application of metagenomics across various domains, including human health, agriculture and environmental monitoring. Looking ahead, we advocate for the development of more robust computational frameworks and enhanced interdisciplinary collaborations. This Primer serves as a comprehensive guide for advancing the precision and applicability of metagenomic studies, positioning them to address the complexities of microbial ecology and their broader implications for human health and environmental sustainability.
more » « less
Free, publicly-accessible full text available December 1, 2026
TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles

https://doi.org/10.1093/gigascience/giad008

Sarwal, Varuni; Brito, Jaqueline; Mangul, Serghei; Koslicki, David (December 2022, GigaScience)

Abstract Background Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole-genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmarking datasets and platforms, standardized taxonomic profile formats, and a benchmarking platform to assess tool performance. While this standardization is essential, there is currently a lack of tools to visualize the standardized output of the many existing taxonomic profilers. Thus, benchmarking studies rely on a single-value metrics to compare performance of tools and compare to benchmarking datasets. This is one of the major problems in analyzing metagenomic profiling data, since single metrics, such as the F1 score, fail to capture the biological differences between the datasets. Findings Here we report the development of TAMPA (Taxonomic metagenome profiling evaluation), a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community. We demonstrate the unique ability of TAMPA to generate a novel biological hypothesis by highlighting the taxonomic differences between samples otherwise missed by commonly utilized metrics. Conclusion In this study, we show that TAMPA can help visualize the output of taxonomic profilers, enabling biologists to effectively choose the most appropriate profiling method to use on their metagenomics data. TAMPA is available on GitHub, Bioconda, and Galaxy Toolshed at https://github.com/dkoslicki/TAMPA and is released under the MIT license.
more » « less
Full Text Available
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

https://doi.org/10.1093/bioinformatics/btac237

Liu, Shaopeng; Koslicki, David (June 2022, Bioinformatics)

Abstract MotivationK-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. ResultsWe derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementationA python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
The minimizer Jaccard estimator is biased and inconsistent

https://doi.org/10.1093/bioinformatics/btac244

Belbasi, Mahdi; Blanca, Antonio; Harris, Robert_S; Koslicki, David; Medvedev, Paul (June 2022, Bioinformatics)

Abstract MotivationSketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. ResultsWe show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e. the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool. Availability and implementationScripts to reproduce our experiments are available at https://github.com/medvedevgroup/minimizer-jaccard-estimator/tree/main/reproduce. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
PressPurt: network sensitivity to press perturbations under interaction uncertainty

https://doi.org/10.12688/f1000research.52317.1

Koslicki, David; Gibbon, Dana; Novak, Mark (January 2022, F1000Research)

While the use of networks to understand how complex systems respond to perturbations is pervasive across scientific disciplines, the uncertainty associated with estimates of pairwise interaction strengths (edge weights) remains rarely considered. Mischaracterizations of interaction strength can lead to qualitatively incorrect predictions regarding system responses as perturbations propagate through often counteracting direct and indirect effects. Here, we introduce PressPurt , a computational package for identifying the interactions whose strengths must be estimated most accurately in order to produce robust predictions of a network's response to press perturbations. The package provides methods for calculating and visualizing these edge-specific sensitivities (tolerances) when uncertainty is associated to one or more edges according to a variety of different error distributions. The software requires the network to be represented as a numerical (quantitative or qualitative) Jacobian matrix evaluated at stable equilibrium. PressPurt is open source under the MIT license and is available as both a Python package and an R package hosted at https://github.com/dkoslicki/PressPurt and on the CRAN repository https://CRAN.R-project.org/package=PressPurt.
more » « less
Full Text Available
The Statistics of k -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

https://doi.org/10.1089/cmb.2021.0431

Blanca, Antonio; Harris, Robert S.; Koslicki, David; Medvedev, Paul (February 2022, Journal of Computational Biology)

Full Text Available
Finer Metagenomic Reconstruction via Biodiversity Optimization

https://doi.org/10.1101/2020.01.23.916924

Foucart, Simon; Koslicki, David (October 2020, Advances in Neural Information Processing Systems 33 (NeurIPS 2020))
Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (Ed.)
When analyzing communities of microorganisms from their sequenced DNA, an important task is taxonomic profiling: enumerating the presence and relative abundance of all organisms, or merely of all taxa, contained in the sample. This task can be tackled via compressive-sensing-based approaches, which favor communities featuring the fewest organisms among those consistent with the observed DNA data. Despite their successes, these parsimonious approaches sometimes conflict with biological realism by overlooking organism similarities. Here, we leverage a recently developed notion of biological diversity that simultaneously accounts for organism similarities and retains the optimization strategy underlying compressive-sensing-based approaches. We demonstrate that minimizing biological diversity still produces sparse taxonomic profiles and we experimentally validate superiority to existing compressive-sensing-based approaches. Despite showing that the objective function is almost never convex and often concave, generally yielding NP-hard problems, we exhibit ways of representing organism similarities for which minimizing diversity can be performed via a sequence of linear programs guaranteed to decrease diversity. Better yet, when biological similarity is quantified by k-mer co-occurrence (a popular notion in bioinformatics), minimizing diversity actually reduces to one linear program that can utilize multiple k-mer sizes to enhance performance. In proof-of-concept experiments, we verify that the latter procedure can lead to significant gains when taxonomically profiling a metagenomic sample, both in terms of reconstruction accuracy and computational performance.
more » « less
Full Text Available
Metalign: efficient alignment-based metagenomic profiling via containment min hash

https://doi.org/10.1186/s13059-020-02159-0

LaPierre, Nathan; Alser, Mohammed; Eskin, Eleazar; Koslicki, David; Mangul, Serghei (December 2020, Genome Biology)
null (Ed.)
Abstract Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profiling. We use a novel containment min hash approach to pre-filter the reference database prior to alignment and then process both uniquely aligned and multi-aligned reads to produce accurate abundance estimates. In performance evaluations on both real and simulated datasets, Metalign is the only method evaluated that maintained high performance and competitive running time across all datasets.
more » « less
Full Text Available
Improving the usability and comprehensiveness of microbial databases

https://doi.org/10.1186/s12915-020-0756-z

Loeffler, Caitlin; Karlsberg, Aaron; Martin, Lana S.; Eskin, Eleazar; Koslicki, David; Mangul, Serghei (December 2020, BMC Biology)
null (Ed.)
Abstract Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.
more » « less
Full Text Available
Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit

https://doi.org/10.1038/s41596-020-00480-3

Meyer, Fernando; Lesker, Till-Robin; Koslicki, David; Fritz, Adrian; Gurevich, Alexey; Darling, Aaron E.; Sczyrba, Alexander; Bremges, Andreas; McHardy, Alice C. (April 2021, Nature Protocols)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records